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We consider the task of computing solutions of linear systems that only differ by a shift with 
Ph the identity matrix as well as linear systems with several different right hand sides. In the past 

Krylov subspace methods have been developed which exploit either the need for solutions to 
multiple right hand sides (e.g. deflation type methods and block methods) or multiple shifts (e.g. 
shifted CG) with some success. In this paper we present a block Krylov subspace method which, 

>- 

q>^ based on a block Lanczos process, exploits both features — shifts and multiple right hand sides — 

at once. Such situations arise, for example, in lattice QCD simulations within the Rational Hybrid 
Monte Carlo algorithm. We give numerical evidence that our method is superior to applying other 
iterative methods to each of the systems individually as well as, in some cases, to shifted or block 

<0 Krylov subspace methods. 
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1. Introduction 



In this paper we consider solving systems of linear equations of the form 



{OjI+A) XiJ 



(1.1) 



where Ojl +A G C' IX " is a hermitian positive definite (hpd) matrix for every shift Oj £ C, x,j,bi £ 
C", i = 1 , . . . ,m and j = 1 , . . . , s. 

Systems like these arise naturally in lattice QCD, where the bi represent multiple source terms, 
and A is a discrete version of the Dirac operator like, e.g. the Wilson fermion matrix. For exam- 
ple the Rational Hybrid Monte Carlo algorithm [8] needs to solve A v Xj = bj with v 6 (—1,1) for 
multiple right hand sides (rhs) bj. The matrix power A v therein is computed using a rational ap- 
proximation which can be represented as a partial fraction expansion thus giving rise to multiple 
shifted systems. Another application is the computation of symplectic integrators that require the 
evaluation of A v BA v Xj = bi for multiple random right hand sides. In that case the inner application 
of A v is approximated via a partial fraction expansion that generates multiple right hand sides for 
the outer application even if there was only one right hand side b\. 

By arranging the m right hand sides and the corresponding solutions in the matrices 



In the past Krylov subspace methods have been developed which exploit the composition of 
multiple right hand sides to blocks and solve each of the j systems in (1.2) in one go [9, 4]. It was 
realised that in these so called block methods deflation, i.e. removing those vectors that become 
linearly dependent, is important in order to guarantee convergence[7]. Other algorithms make use 
of multiple shifts for systems with a single right hand side and compute solutions for every shifted 
system by spanning the Krylov subspace just once [6]. For non-hermitian matrices there even exists 
a method that combines multiple right hand sides-including deflation-and shifted systems [5]. All 
these methods improved the computational costs for solving the kind of problem they are focused 
on. 

We here present a new, deflated shifted block CG (DSBlockCG) method that merges every 
aspect: the block idea, the idea of computing solutions for the shifted systems alongside one seed 
system and the focus on hpd matrices. It is based on a block Lanczos process which is the restriction 
to the hermitian case of the non-symmetric block Lanczos process from [1]. This process is capable 
of handling multiple starting vectors and includes deflation. 

2. Block Krylov subspace methods 

For ease of presentation we first disregard the shifts, i.e. we deal with systems 



B = \b\ | • • • \b m ] and Xj = [xij 




we can rewrite the systems (1.1) as 



(GjI+A)Xj=B. 



(1.2) 



Axi = bj,i= 1,..., 



m. 
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We define the k-th block Krylov subspace with respect to A and B = [bi\ ■ ■ ■ \b m ] as 



K k (A,B) = span jfci,..., b m ,Ab u . . . ,Ab m ,A 2 b u . . . 
=: colspan [ B \ AB \ A 2 B \ ■■■ \ A k ~ l B . 




(2.1) 



Clearly, all the Krylov subspaces K k (A,bj) := span {bi,Abi, . . . ,A k ~ l bj} are contained in K k (A,B). 
While the dimension of each space K k (A,bi) is k (unless we have reached an invariant subspace), 
the dimension of the block subspace can be smaller than mk. This is due to the fact that some of the 
subspaces K k (A,bi) can have non-trivial intersections even when all the b[ are linearly independent. 

The block conjugate gradient methods in [9] and [4] create block iterates = [xj | . . . \xm] 
with Xj S K k (A,B) and advance in each step from K k (A,B) to K k+ \(A,B). The iterates are 
obtained such that they satisfy the Galerkin condition 



which in the non-block case reduces to the classical variational characterisation of the CG iterates. 
Let denote a matrix whose columns form a basis of Kk(A,B). Then the Galerkin condition is 
equivalent to 



3. The deflated shifted block CG method 

Based on the block Lanczos process[l] we can now explain the DSBlockCG method. We start 
by considering the unshifted block system AX = B. Handling additional shifts will be addressed at 
the end of this section. For details on the derivation and implementation of the algorithm we refer 
to our upcoming paper [2]. 

The block Lanczos process generates matrices and V^ k > . The orthonormal columns of the 
latter form a basis of a ^-dimensional subspace which we will call the k-th deflated block Krylov 
subspace K^ efi (A,B). Additionally the relation 



holds and is a hermitian, banded matrix with semi-bandwidth m. Note, that the dimension 
k of K^ ea (A,B) differs from the dimension of the non-deflated subspace in (2.1) which 

is at most mk, but might be less if some vectors spanning Kk{A,B) are linearly dependent. As 
soon as such deflation occurs, the bandwidth of the trailing right lower submatrix of T^' decreases 
accordingly. 

A deflated block Krylov subspace method generates iterates = [x^\ . . . \x$], s.t. Xj € 
K^ en (A,B). Building upon (2.2), but using the orthogonal basis of the deflated block Krylov sub- 
space and (3.1), we obtain the iterates X^ k > as 



X® G K k (A,B) and7? w =B-AX {k) ±K k (A,B) 




(2.2) 



T (k) = (y{k)f A y(k) m 



(3.1) 



x (k) := y«( r «)-i(y«)ff B . 



(3.2) 
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In order to obtain a feasible iterative method, cheap updates for the iterates have to be obtained. 
By computing a root- free Cholesky decomposition LWDW (LW ) H of the matrix T^> we can update 
the iterates via 

x (k) _ y(*)( L m D (*)( L (*)\#)-l(y(*)\H B 

where dW is the (&,fc)-entry in DW, g C" and G C lxm . The important property for the 
feasibility of our method is that the Cholesky decomposition as well as pW and can be updated 
with short recurrences of length m or even less if deflation occurred. 

In order to be able to stop the iteration we should be able to compute the norms of the residuals 

(k) (k) 

r, = bj — Axj , 1 < j < m. Fortunately, although the residuals are not directly available, their 
norms can be computed at very low additional cost. The iterates 1^ are generated in such a way, 
that the residuals /?W are orthogonal to the Lanczos vectors for j < i — m. They may thus be 
written as 

R (k) = w (k) c (k) 

where G C mxm and = [v^| . . . |v( i '+ w - 1 )] G C nxm . Thus, if we are only interested in the 
norm of each of the residuals we can just compute the norm of the columns of CW g C mxm , because 

has orthonormal columns. Indeed, there is a computationally cheap way to compute using 
only the matrix and its Cholesky decomposition. If deflation occurred the number of columns 
of and rows even decreases. 

The block method so far can be extended to handle shifted systems and multiple right hand 
sides at the same time. For ease of notation we focus on a situation where we have just one 
additional shift a and use the notation A a := al+A. Matrices and vectors belonging to the shifted 
system will also be noted by the index a. Krylov subspaces are shift invariant, i.e. Kk{A,b) = 
Kk{A a ,b) for all k, see [10]. Moreover, starting with the initial vector b, the Lanczos process 
produces exactly the same vectors, whether we take A or A a . This property immediately carries 
over to the block Lanczos process and to the deflated block Lanczos process. We can make use of 

{V^) H {aI n +A)V^ = al k + T« =: t£\ 

so that we do not have to compute VW and for each shifted system individually. Instead 
we can reuse these from the unshifted system AX = B and only have to compute the Cholesky 
decomposition and d£ as well as the vectors py and u£ for each shift. 

4. Numerical results 

In this section we present and discuss results of applying the algorithm from section 3 to some 
QCD test problems. We were working on non-parallelised algorithms, therefore our tests were 
restricted to rather small lattice sizes. A Chroma implementation of the algorithm is currently 
under development. 

In order to obtain fair time measurements we implemented all of the algorithms in C++ using 
the uBLAS library [3] for sparse and dense BLAS operations. We stored the Wilson Dirac operator 
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D as a sparse matrix and did not exploit any further properties like its four dimensional lattice struc- 
ture or symmetries. We chose a quenched configuration with temperature /3 =6.0. As compiler 
we used g++ in version 4.5.1. The results were produced on a Core 2 Quad Q9650 running at 3.00 
GHz. Since our programs were not parallelized they were running a single core of that machine. 

We compare our algorithm to three other algorithms. The first of them is non-preconditioned 
conjugate gradients (CG). For CG we have to solve each of the systems 

(o j I+A)x iJ = b i 

separately. The second algorithm is the shifted conjugate gradients algorithm from [6] referred to 
as shiftedCG. It has to be applied m times, once for each right hand side For each right hand side 
all s shifted systems are solved at the same time. The third and last competing algorithm is BCGrQ 
from [4]. For each shift, this block algorithm solves all the systems belonging to the various right 
hand sides in one go. 



DSBIockCG shiftedCG 




500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 3500 4000 




7 shifts. Horizontal axis: time in seconds. Vertical axis: relative norm of residual. The saw tooth shape 
of the plots has two causes: a) for DSBIockCG and shiftedCG only the residual for the worst conditioned 
non-converged system is computed and plotted, b) methods that can not solve all of the systems at the same 
time need subsequent runs which are plotted one after the other. 



Figure 1 shows a generic run of all four algorithms for a 16 4 lattice to a target relative residual 
of 10~ 8 . We chose the same 4 random right hand sides for all methods. The 7 shifts were chosen, 
s.t. the smallest eigenvalue of the worst conditioned shifted system was of magnitude 10~ 6 and for 
the best conditioned system it was of magnitude 10 _1 . DSBIockCG in the top left plot shows a 
saw tooth like convergence. This is caused by just computing the residual for the best conditioned 
not yet converged system which saves some work. Every time the plot of the relative residual 
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jumps up, all the shifted systems for one right hand side have converged and the residuals for next 
non-converged system are computed. Thus, this plot shows that it is important to stop updating 
the iterates of well conditioned systems as soon as they reach the target residual. This speeds up 
the computation of the remaining systems. The top right shiftedCG plot shows that the different 
random right hand side vectors show almost the same convergence behaviour. The bottom two 
plots show clearly that CG and BCGrQ are not competitive. Table 1 gives the plain numbers for 
the same test run and shows that shiftedCG takes about 30% more time than DSBlockCG. The gap 
in the number of matrix- vector multiplications (mvms) is even bigger. As a result of Figure 1 we 
can constrain the remaining tests to comparisons of DSBlockCG and shiftedCG. 





CG BlockCG shiftedCG DSBlockCG 


mvms 
time in seconds 
relative time 


22337 13488 4883 2766 
13970.7 10993.3 3630.53 2778.82 
5.03 3.96 1.31 1 



Table 1: Number of mvms, time and relative time as compared to DSBlockCG. 



relative time DSBIockCG/siiiftedCG for 1 to 1 shifts relative number of mvms DSBIockCG/shiftedCG for 1 to 1 shifts 

2 i 1 1 1 1 1 1 1 1 

1.5 " 

1 " 

0.5" * * * * * * * * - 

- 

-O.5I 1 1 1 1 1 1 1 1 1 

123456789 10 123456789 10 

Figure 2: The left plot shows the relative time (vertical axis) of algorithm DSBlockCG over shiftedCG on 
a 16 4 lattice for a different number of shifts (horizontal axis) and a fixed number of 4 rhs. The right plot 
displays the relative number of mvms for the same run. 

Figure 2 displays the dependence on the number of shifts while the number of right hand 
sides stayed fixed at 4. The results were produced using the same 16 4 lattice configuration as in 
the previous example. We chose the first shift, s.t. the smallest eigenvalue of the shifted system 
was of magnitude 10~ 6 . We then successively increased the number of shifts, s.t. the smallest 
eigenvalues of the shifted systems were distributed in the interval [10~ 6 , 10 -1 ]. The plots show 
that the number of mvms depends only on the worst conditioned system which means that for any 
number of shifts DSBlockCG only needs about half the number of mvms compared to shiftedCG. 
The computational costs on the other hand are rising for an increased number of shifts. 

In Figure 3 we display a test run for another test configuration on a 12 4 lattice. In this case 
the number of 10 shifts stayed fixed. Like in the previous test cases the shifts were chosen, s.t. 
the smallest eigenvalue of the worst conditioned shifted system was of magnitude 10~ 6 and for the 
best conditioned system it was of magnitude 10 _1 . There is a sweet spot at 4 shifts after which 
the additional costs start to become dominant and DSBlockCG becomes slower than shiftedCG, 
eventually. 
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relative time DSBIockCG/shiftedCG for 1 to 9 right hand sides relative number of mvms DSBIockCG/shiftedCG for 1 to 9 right hand sides 




123456789 123456789 



Figure 3: The left plot shows the relative time (vertical axis) of algorithm DSBlockCG over shiftedCG on 
a 12 4 lattice for a different number of right hand sides (horizontal axis) and fixed number of 10 shifts. The 
right plot displays the relative number of mvms for the same run. 

5. Conclusions 

We proposed a new iterative method for the solution of systems of linear equations of the form 
{Gjl +A)xu = based on a block Lanczos-type process. This method was shown to converge 
faster than just applying CG to every system or applying block CG methods to systems belonging 
to a single shift. For reasonable numbers of right hand sides and shifts our new method even proved 
to be faster that shiftedCG. 
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